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1 Short title: Automated retrieval of ecological networks data 

2 Keywords: R, API, database, open data, ecological networks, species interactions 



3 The study of ecological networks is severely limited by (i) the difficulty to access data, (ii) the lack of a 

4 standardized way to link meta-data with interactions, and (iii) the disparity of formats in which ecologi- 

5 cal networks themselves are represented. To overcome these limitations, we conceived a data specification 
e for ecological networks. We implemented a database respecting this standard, and released a R package ( 
7 rmangal) allowing users to programmatically access, curate, and deposit data on ecological interactions. In 
a this article, we show how these tools, in conjunctions with other frameworks for the programmatic manipu- 

9 lation of open ecological data, streamlines the analysis process, and improves eplicability and reproducibility 

10 of ecological networks studies. 
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1 Introduction 

2 Ecological networks enable ecologists to accommodate the complexity of natural communities, and to discover mech- 

3 anisms contributing to their persistence, stability, resilience, and functioning. Most of the "early" studies of ecological 

4 networks were focused on understanding how the structure of interactions within one location affected the ecological 

5 properties of this local community. Such analyses revealed the contribution of 'average' network properties, such as the 
e buffering impact of modularity on species loss (Pimm et al. 1991, ), the increase in robustness to extinctions along with 
7 increases in connectance (Dunne et al. 2002), and the fact that organization of interactions maximizes biodiversity (Bas- 
s tolla et al. 2009). More recently, new studies introduced the idea that networks can vary from one realization to another. 

9 They can be meaningfully compared, either to understand the importance of environmental gradients on the realization of 

10 ecological interactions [@tylianakis_habitat_2007], or to understand the mechanisms behind variation in the structure of 

11 ecological networks (Poisot et al. 2012). Yet, meta-analyses of a large number of ecological networks are still extremely 

12 rare, and most of the studies comparing several networks do so within the limit of particular systems (Schleuning et al. 

13 201 1; Dalsgaard et al. 2013). The severe shortage of data in the field also restricts the scope of large-scale analyses. 

14 An increasing number of approaches are being put forth to predict the structure of ecological networks, either relying on 

15 latent variables (Rohr et al. 2010) or actual traits (Gravel et al. 2013). Such approaches, so as to be adequately calibrated, 

16 require easily accessible data. Comparing the efficiency of different methods is also facilitated if there is an homogeneous 

17 way of representing ecological interactions, and the associated metadata. In this paper, we (i) establish the need of a data 
is specification serving as a lingua franca among network ecologists, (ii) describe this data specification, and (iii) describe 

19 rmangal, a R package and companion database relying on this data specification. The rmangal package allows to easily 

20 retrieve, but also deposit, ecological interaction networks data from a database. We provide some use cases showing how 

21 this new approach makes complex analyzes simpler, and allows for the integration of new tools to manipulate biodiversity 

22 resources. 

23 Networks need a data specification 

24 Ecological networks are (often) stored as an adjacency matrix (or as the quantitative link matrix), that is a series of 0 and 1 

25 indicating, respectively, the absence and presence of an interaction. This format is extremely convenient for use (as most 

26 network analysis packages, e.g. bipartite, betalink, f oodweb, require data to be presented this way), but is extremely 

27 inefficient at storing meta-data. In most cases, an adjacency matrix informs on the identity of species (in cases where 

28 rows and columns headers are present), and the presence or absence of interactions. If other data about the environment 

29 {e.g. where the network was sampled) or the species {e.g. the population size, trait distribution, or other observations) are 
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1 available, they are most either given in other files, or as accompanying text. In both cases, making a programmatic link 

2 between interaction data and relevant meta-data is difficult and error-prone. 

3 By contrast, a data specification (i.e. a set of precise instructions detailing how each object should be represented) provides 

4 a common language for network ecologists to interact, and ensure that, regardless of their source, data can be used in a 

5 shared workflow. Most importantly, a data specification describes how data are exchanged. Each group retains the ability 
e to store the data in the format that is most convenient for in-house use, and only needs to provide export options (e.g. 
7 through an API, i.e. a programmatic interface running on a webserver, returning data in response to queries in a pre- 
s determined language) respecting the data specification. This approach ensures that all data can be used in meta-analyses, 

9 and increases the impact of data (Piwowar et al. 2007; Piwowar & Vision 2013). 

10 Elements of the data specification 

11 The data specification (Fig. 1) is built around the idea that (ecological) networks are collections of relationships between 

12 ecological objects, each element having particular meta-data associated. In this section, we detail the way networks 

13 are represented in the mangal specification. An interactive webpage with the elements of the data specification can be 

14 found online at http : //mangal .uqar . ca. /doc/spec/. The data specification is available either at the API root (e.g. 

15 http : //mangal . uqar . ca/api/vl/?f ormat=j son), or can be viewed using the whatls function from the R package 

16 (see Supp. Mat. 1). Rather than giving an exhaustive list of the data specification (which is available online at the 

17 aforementioned URL), this section serves as an overview of each element, and how they interact. 

is We propose JSON, a format equivalent to XML, as an efficient way to uniformise data representation for two main reasons. 

19 First, it has emerged as a de facto standard for web platform serving data, and accepting data from users. Second, it allows 

20 validation of the data: a JSON file can be matched against a scheme, and one can verify that it is correctly formatted (this 

21 includes the possibility that not all fields are filled, as will depend on available data). Finally, JSON objects are easily 

22 and cheaply (memory-wise) parsed in the most common programming languages, notably R (equivalent to list) and 

23 python (equivalent to diet). For most users, the format in which data are transmitted is unimportant, as the interaction 

24 happens within R - as such, knowing how JSON objects are organized is only useful for those who want to interact with 

25 the API directly. The rmangal package takes care of converting the data into the correct JSON format to upload them in 

26 the database. 
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Fig. 1: An overview of the data specification, and the hierarchy between objects. Each box correspond to a level of the 
data specification. Grey boxes are nodes, blue boxes are interactions and networks, and green boxes are meta- 
data. The bold boxes (dataset, network, interaction, taxa) are the minimal elements needed to represent a 
network. 
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1 Node information 

2 Taxa 

3 Taxa are a taxonomic entity of any level, identified by their name, vernacular name, and their identifiers in a variety of 

4 taxonomic services. Associating the identifiers of each taxa is important to leverage the power of the new generation of 

5 open data tools, such as taxize [@chamberlain_taxize_2013]. The data specification currently has fields for ncbi, gbif , 
e itis, eol and bold identifiers. We also provide the taxonomic status, i.e. whether a taxa is a true taxonomic entity, a 
7 "trophic species", or a morphospecies. 

s Population 

9 A population is one observed instance of a taxa object. If your experimental design is replicated through space, then 

10 each taxa have a population object corresponding to each locality. Populations do not have associated meta-data, but 

11 serve as "containers" for item objects. 

12 Item 

13 An item is an instance of a population. Items have a level argument, which can be either individual or population; 

14 this allows to represent both individual-level networks {i.e. there are as many items attached to a population than there 
is were individuals of this population sampled), and population-level networks. When item represents a population, it 

16 is possible to give a measure of the size of this population. The notion of item is particularly useful for time-replicated 

17 designs: each observation of a population at a time-point is an item with associated trait values, and possibly population 
is size. 

19 Network information 

20 Interaction 

21 An interaction links, a minima, two taxa objects (but can also link pairs of populations or items). The most 

22 important attributes of interactions are the type of interaction (of which we provide a list of possible values, see Supp. 

23 Mat. 1), and its nature, i.e. how it was observed. This field help differentiate direct observations, text mining, and 

24 inference. Note that the nature field can also take absence as a value; this is useful for, e.g., "cafeteria" experiments in 

25 which there is high confidence that the interaction did not happen. 
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1 Network 

2 A network is a series of interaction object, along with (i) informations on its spatial position (provided at the latitude 

3 and longitude), (ii) the date of sampling, and (iii) references to measures of environmental conditions. 

4 Dataset 

5 A dataset is a collection of one or several network(s). Datasets also have a field for data and papers, both of which 
e are references to bibliographic or web resources describing, respectively, the source of the data, and the papers in which 
7 these data have been significantly used. Datasets are the preferred entry point in the resources. 

s Meta-data 

9 Trait value 

10 Objects of type item can have associated trait values. These consist in the description of the trait being measured, the 

11 value, and the units in which the measure was taken. 

12 Environmental condition 

13 Environmental conditions are associated to datasets, networks, and interactions objects, to allow for both macro and micro 

14 environmental conditions. These are defined by the environmental property measured, its value, and the units. 

is References 

16 References are associated to datasets. They accommodate the DOI, JSON or PubMed identifiers, or a URL. When 

17 possible, the DOI should be preferred as it offers more potential to interact with other on-line tools, such as the CrossRef 

18 API. 

19 Use cases 

20 In this section, we present use cases using the rmangal package for R, to interact with a database implementing this data 

21 specification, and serving data through an API (http : //mangal . uqar . ca/api/vl/). It is possible for users to deposit 

22 data into this database, through the R package. Data are made available under a CC-0 Waiver (Poisot et al. 2013). Detailed 

23 informations about how to upload data are given in the vignettes and manual of the rmangal package. So as to save room 
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1 in the manuscript, we source each example; the complete r files to reproduce the examples of this section are attached as 

2 Suppl. Mat. . In addition, the rmangal package comes with vignettes explaining how users can upload their data into the 

3 database, through R. 

4 The data we use for this example come from Ricciardi et al. (2010). These were previously available on the Interaction- 

5 Web DataBase as a single xls file. We uploaded them in the mangal database at http : //mangal .uqar . ca/api/vl/dataset/1. 

e Link-species relationships 

7 In the first example, we visualize the relationship between the number of species and the number of interactions, which 

a Martinez (1992) propose to be linear (in food webs). 

source ("usecases/l_ls . r" ) 

g Producing this figure requires less than 10 lines of code. The only information needed is the identifier of the network 

10 or dataset, which we suggest should be reported in publications as: "These data were deposited in the mangal format 

11 at <URL>/api/vl/dataset/<ID>", possibly in the acknowledgements. So as to encourage data sharing, we encourage 

12 users of the database to cite the original dataset or publication. 

13 Network beta-diversity 

14 In the second example, we use the framework of network /3 -diversity (Poisot et al. 2012) to measure the extent to which 
is networks that are far apart in space have different interactions. Each network in the dataset has a latitude and longitude, 

16 meaning that it is possible to measure the geographic distance between two networks. 

17 For each pair of network, we measure the geographic distance (in km.), the species dissimilarity (j3s), the network dissim- 

18 ilarity when all species are present (pViv), and finally, the network dissimilarity when only shared species are considered 

source ( "usecases/2_beta . r " ) 

20 As shown in Fig. XX, while species dissimilarity and overall network dissimilarity increase when two networks are far 

21 apart, this is not the case for the way common species interact. This suggests that in this system, network dissimilarity 

22 over space is primarily driven by species turnover. The ease to gather both raw interaction data and associated meta-data 

23 make producing this analysis extremely straightforward. 
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Fig. 2: Relationship between the number of species and number of interactions in the anemonefish-fish dataset. 
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3: Relationships between the geographic distance between two sites, and the species dissimilarity, network dissimi- 
larity with all, and only shared, species. 
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1 Spatial visualization of networks 

2 Bascompte (2009) uses an interesting visualization for spatial networks, in which each species is laid out on a map at the 

3 center of mass of its distribution; interactions are then drawn between species to show how species distribution determines 

4 biotic interactions. In this final use case, we propose to reproduce a similar figure. 

source ("usecases/3_spatial . r") 




Fig. 4: Spatial plot of a network, using the maps and rmangal packages. The circle in the inset map show the location of 
the sites. Each dot in the main map represents a species, with interactions drawn between them. 
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Conclusions 

In this contribution, we presented mangal, a data format for the exchange of ecological networks and associated meta- 
data. We deployed an online database with an associated API, relying on this data specification. Finally, we introduced 
rmangal, a R package designed to interact with APIs using the mangal format. We expect that the data specification 
will evolve based on the needs of the community. At the moment, users are welcome to propose such changes on the 
project issue page: https : //github . com/mangal-wg/mangal-schemes/issues. A python wrapper for the API is 
also available at http : //github . com/mangal- wg/pymangal/. 
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